Search CORE

9 research outputs found

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Author: Jordà Marc
Peña Monferrer Antonio José
Valero-Lara Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A symbolic emulator for shuffle synthesis on the NVIDIA PTX code

Author: García de Gonzalo Simón
Matsumura Kazuaki
Peña Monferrer Antonio José
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/01/2023
Field of study

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.We are funded by the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 801051 and the Ministerio de Ciencia e Innovación-Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033). This work has been partially carried out on the ACME cluster owned by CIEMAT and funded by the Spanish Ministry of Economy and Competitiveness project CODEC-OSE (RTI2018-096006-B-I00).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

ecoHMEM: Improving object placement methodology for hybrid memory systems in HPC

Author: Ayguadé Parra Eduard
Jordà Peroliu Marc
Labarta Mancho Jesús José
Peña Monferrer Antonio José
Rai Siddharth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Recent byte-addressable persistent memory (PMEM) technology offers capacities comparable to storage devices and access times much closer to DRAMs than other non-volatile memory technology. To palliate the large gap with DRAM performance, DRAM and PMEM are usually combined. Users have the choice to either manage the placement to different memory spaces by software or leverage the DRAM as a cache for the virtual address space of the PMEM. We present novel methodology for automatic object-level placement, including efficient runtime object matching and bandwidth-aware placement. Our experiments leveraging Intel® Optane™ Persistent Memory show from matching to greatly improved performance with respect to state-of-the-art software and hardware solutions, attaining over 2x runtime improvement in miniapplications and over 6% in OpenFOAM, a complex production application.This paper received funding from the Intel-BSC Exascale Laboratory SoW 5.1, the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 749516, the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No 801051, the DEEP-SEA project from the European Commission’s EuroHPC program under grant agreement 955606, and the Ministerio de Ciencia e Innovacion—Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

DMR API: Improving cluster productivity by turning applications into malleable

Author: Beltrán Vicenç
Iserte Agut Sergio
Mayo Gual Rafael
Peña Monferrer Antonio José
Quintana Ortí Enrique Salvador
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

[EN] Adaptive workloads can change on-the-fly the configuration of their jobs, in terms of number of processes. To carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime. to change its number of MPI ranks. The collaboration between both the workload manager-aware of the queue of jobs and the resources allocation-and the parallel runtime-able to transparently handle the processes and the program data-is crucial for our throughput-aware malleability methodology. Hence, when a job triggers a reconfiguration, the resource manager will check the cluster status and return the appropriate action: i) expand, if there are spare resources; ii) shrink, if queued jobs can be initiated; or iii) none, if no change can improve the global productivity. In this paper, we describe the internals of our framework and demonstrate how it reduces the global workload completion time along with providing a more efficient usage of the underlying resources. For this purpose, we present a thorough study of the adaptive workloads processing by showing the detailed behavior of our framework in representative experiments. (C) 2018 Elsevier B.V. All rights reserved.Iserte Agut, S.; Mayo Gual, R.; Quintana Ortí, ES.; Beltrán, V.; Peña Monferrer, AJ. (2018). DMR API: Improving cluster productivity by turning applications into malleable. Parallel Computing. 78:54-66. https://doi.org/10.1016/j.parco.2018.07.006S54667

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Repositori Institucional de la Universitat Jaume I

RiuNet

Particle-in-cell simulation using asynchronous tasking

Author: Barreto João
Ceyrat Pedro
Fonseca Ricardo
Guidotti Nicolas
Martorell Bofill Xavier
Monteiro José
Peña Monferrer Antonio José
Rodrigues Rodrigo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models

Author: Balaji Pavan
Castelló-Gimeno Adrián
Mayo Gual Rafael
Peña Monferrer Antonio José
Planas Judit
Quintana Ortí Enrique Salvador
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

[EN] Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can introduce two problems: an increase in the total cost of ownership and their underutilization because not all codes match their architecture. Remote accelerator virtualization frameworks address those problems. In particular, rCUDA provides transparent access to any graphic processor unit installed in a cluster, reducing the number of accelerators and increasing their utilization ratio. Joining these two technologies, directive-based programming models and rCUDA, is thus highly appealing. In this work, we study the integration of OmpSs and OpenACC with rCUDA, describing and analyzing several applications over three different hardware configurations that include two InfiniBand interconnections and three NVIDIA accelerators. Our evaluation reveals favorable performance results, showing low overhead and similar scaling factors when using remote accelerators instead of local devices.The researchers from the Universitat Jaume I de Castello were supported by Universitat Jaume I research project (P11B2013-21), project TIN2014-53495-R, a Generalitat Valenciana grant and FEDER. The researcher from the Barcelona Supercomputing Center (BSC-CNS) Lausanne was supported by the European Commission (HiPEAC-3 Network of Excellence, FP7-ICT 287759), Intel-BSC Exascale Lab collaboration, IBM/BSC Exascale Initiative collaboration agreement, Computacion de Altas Prestaciones VI (TIN2012-34557) and the Generalitat de Catalunya (2014-SGR-1051). This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. The initial version of rCUDA was jointly developed by Universitat Politecnica de Valencia (UPV) and Universitat Jaume I de Castellon (UJI) until year 2010. This initial development was later split into two branches. Part of the UPV version was used in this paper. The development of the UPV branch was supported by Generalitat Valenciana under Grants PROMETEO 2008/060 and Prometeo II 2013/009. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.Castelló-Gimeno, A.; Peña Monferrer, AJ.; Mayo Gual, R.; Planas, J.; Quintana Ortí, ES.; Balaji, P. (2018). Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models. The Journal of Supercomputing. 74(11):5628-5642. https://doi.org/10.1007/s11227-016-1791-yS562856427411Strohmaier E, Dongarra J, Simon H, Meuer M (2015) TOP500 supercomputing sites. http://www.top500.org/lists/2015/11 . Accessed Nov 2015NVIDIA (2015) CUDA API reference, version 7.5Shreiner D, Sellers G, Kessenich JM, Licea-Kane BM (2013) OpenGL programming guide: the official guide to learning OpenGL. Addison-Wesley Professional, BostonMark WR, Glanville RS, Akeley K, Kilgard MJ (2003) Cg: a system for programming graphics hardware in a C-like language. ACM Trans Graph (TOG) 22(3):896–907Munshi A (2014)The OpenCL specification 2.0. 0.5em minus 0.4em Khronos OpenCL working groupOpenACC directives for accelerators (2015). http://www.openacc-standard.org . Accessed Dec 2015OmpSs project home page. http://pm.bsc.es/ompss . Accessed Dec 2015OpenMP application program interface 4.0 (2013). OpenMP Architecture BoardPeña AJ (2013) Virtualization of accelerators in high performance clusters. Ph.D. dissertation, Universitat Jaume I, CastellónKawai A, Yasuoka K, Yoshikawa K, Narumi T (2012) Distributed-shared CUDA: virtualization of large-scale GPU systems for programmability and reliability. In: International conference on future computational technologies and applicationsShi L, Chen H, Sun J, Li K (2012) vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans Comput 61(6):804–816Xiao S, Balaji P, Zhu Q, Thakur R, Coghlan S, Lin H, Wen G, Hong J, Feng W (2012) VOCL: an optimized environment for transparent virtualization of graphics processing units. In: Innovative parallel computing. IEEE, New YorkKim J, Seo S, Lee J, Nah J, Jo G, Lee J (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: International conference on supercomputingDuran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(02):173–193Castelló A, Duato J, Mayo R, Peña AJ, Quintana-Ortí ES, Roca V, Silla F (2014) On the use of remote GPUs and low-power processors for the acceleration of scientific applications. In: The fourth international conference on smart grids, green communications and IT energy-aware technologies, pp 57–62Iserte S, Castelló A, Mayo R, Quintana-Ortí ES, Reaño C, Prades J, Silla F, Duato J (2014) SLURM support for remote GPU virtualization: implementation and performance study. In: International symposium on computer architecture and high performance computing (SBAC-PAD)Peña AJ, Reaño C, Silla F, Mayo R, Quintana-Ortí ES, Duato J (2014) A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Comput 40(10):574–588Kegel P, Steuwer M, Gorlatch S (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: International parallel and distributed processing symposium workshops (IPDPSW)Castelló A, Peña AJ, Mayo R, Balaji P, Quintana-Ortí ES (2015) Exploring the suitability of remote GPGPU virtualization for the OpenACC programming model using rCUDA. In: IEEE international conference on cluster computingCastelló A, Mayo R, Planas J, Quintana-Ortí ES (2015) Exploiting task-parallelism on GPU clusters via OmpSs and rCUDA virtualization. In: IEEE international workshop on reengineering for parallelism in heterogeneous parallel platformsHP Corp., Intel Corp., Microsoft Corp., Phoenix Tech. Ltd., Toshiba Corp. (2011) Advanced configuration and power interface specification, revision 5.0Reaño C, Silla F, Castelló A, Peña AJ, Mayo R, Quintana-Ortí ES, Duato J (2014) Improving the user experience of the rCUDA remote GPU virtualization framework. Concurr Comput 27(14):3746–3770PGI compilers and tools (2015) http://www.pgroup.com/ . Accessed Dec 2015Johnson N (2013) EPCC OpenACC benchmark suite. https://www.epcc.ed.ac.uk/ . Accessed Dec 2015Herdman J, Gaudin W, McIntosh-Smith S, Boulton M, Beckingsale D, Mallinson A, Jarvis SA (2012) Accelerating hydrocodes with OpenACC, OpenCL and CUDA. In: SC companion: high performance computing, networking, storage and analysi

Infoscience - École polytechnique fédérale de Lausanne

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Repositori Institucional de la Universitat Jaume I

RiuNet

Improving the User Experience of the rCUDA Remote GPU Virtualization Framework

Author: Castello Gimeno Adrián
Duato Marín José Francisco
Mayo Gual Rafael
Peña Monferrer Antonio José
Quintana Ortí Enrique Salvador
Reaño González Carlos
Silla Jiménez Federico
Publication venue: 'Wiley'
Publication date: 01/10/2014
Field of study

Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster. Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns with respect to usability, performance, and support for new CUDA features. In response, in this paper, we present a new rCUDA version that (1) improves usability by including a new component that allows an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result, for any CUDA-compatible program, rCUDA now allows the use of remote GPUs within a cluster with low overhead, so that a single application running in one node can use all GPUs available across the cluster, thereby extending the single-node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. The author from Argonne National Laboratory was supported by the US Department of Energy, Office of Science, under Contract No. DE-AC02-06CH11357. The authors are also grateful for the generous support provided by Mellanox Technologies.Reaño González, C.; Silla Jiménez, F.; Castello Gimeno, A.; Peña Monferrer, AJ.; Mayo Gual, R.; Quintana Ortí, ES.; Duato Marín, JF. (2015). Improving the User Experience of the rCUDA Remote GPU Virtualization Framework. Concurrency and Computation: Practice and Experience. 27(14):3746-3770. https://doi.org/10.1002/cpe.3409S374637702714NVIDIA NVIDIA industry cases http://www.nvidia.es/object/tesla-case-studiesFigueiredo, R., Dinda, P. A., & Fortes, J. (2005). Guest Editors’ Introduction: Resource Virtualization Renaissance. Computer, 38(5), 28-31. doi:10.1109/mc.2005.159Duato J Igual FD Mayo R Peña AJ Quintana-Ortí ES Silla F An efficient implementation of GPU virtualization in high performance clusters Euro-Par 2009 Workshops, ser. LNCS, 6043 Delft, Netherlands, 385 394Duato J Peña AJ Silla F Mayo R Quintana-Ortí ES Performance of CUDA virtualized remote GPUs in high performance clusters International Conference on Parallel Processing, Taipei, Taiwan 2011 365 374Duato J Peña AJ Silla F Fernández JC Mayo R Quintana-Ortí ES Enabling CUDA acceleration within virtual machines using rCUDA International Conference on High Performance Computing, Bangalore, India 2011 1 10Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112Gupta V Gavrilovska A Schwan K Kharche H Tolia N Talwar V Ranganathan P GViM: GPU-accelerated virtual machines 3rd Workshop on System-Level Virtualization for High Performance Computing, Nuremberg, Germany 2009 17 24Giunta G Montella R Agrillo G Coviello G A GPGPU transparent virtualization component for high performance computing clouds Euro-Par 2010 - Parallel Processing, 6271 Ischia, Italy, 379 391Zillians VGPU http://www.zillians.com/vgpuLiang TY Chang YW GridCuda: a grid-enabled CUDA programming toolkit Proceedings of the 25th IEEE International Conference on Advanced Information Networking and Applications Workshops (WAINA), Biopolis, Singapore 2011 141 146Barak A Ben-Nun T Levy E Shiloh A Apackage for OpenCL based heterogeneous computing on clusters with many GPU devices Workshop on Parallel Programming and Applications on Accelerator Clusters, Heraklion, Crete, Greece 2010 1 7Xiao S Balaji P Zhu Q Thakur R Coghlan S Lin H Wen G Hong J Feng W-C VOCL: an optimized environment for transparent virtualization of graphics processing units Proceedings of InPar, San Jose, California, USA 2012 1 12Kim J Seo S Lee J Nah J Jo G Lee J SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters Proceedings of the 26th International Conference on Supercomputing, Venice, Italy 2012 341 352NVIDIA The NVIDIA CUDA Compiler Driver NVCC Version 5, NVIDIA 2012Quinlan D Panas T Liao C ROSE http://rosecompiler.org/Free Software Foundation, Inc. GCC, the GNU Compiler Collection http://gcc.gnu.org/LLVM Clang: a C language family frontend for LLVM http://clang.llvm.org/Martinez G Feng W Gardner M CU2CL: a CUDA-to-OpenCL Translator for Multi- and Many-core Architectures http://eprints.cs.vt.edu/archive/00001161/01/CU2CL.pdfLLVM The LLVM compiler infrastructure http://llvm.org/Reaño C Peña AJ Silla F Duato J Mayo R Quintana-Orti ES CU2rCU: towards the complete rCUDA remote GPU virtualization and sharing solution Proceedings of the 19th International Conference on High Performance Computing (HiPC), Pune, India 2012 1 10NVIDIA The NVIDIA GPU Computing SDK Version 4, NVIDIA 2011Sandia National Labs LAMMPS molecular dynamics simulator http://lammps.sandia.gov/Citrix Systems, Inc. Xen http://xen.org/Peña AJ Virtualization of accelerators in high performance clusters Ph.D. Thesis, 2013NVIDIA CUDA profiler user's guide version 5, NVIDIA 2012Igual, F. D., Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., van de Geijn, R. A., & Van Zee, F. G. (2012). The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations. Journal of Parallel and Distributed Computing, 72(9), 1134-1143. doi:10.1016/j.jpdc.2011.10.014Slurm workload manager http://slurm.schedmd.co

Queen's University Belfast Research Portal

Crossref

Repositori Institucional de la Universitat Jaume I

RiuNet

Boosting the performance of remote GPU virtualization using InfiniBand Connect-IB and PCIe 3.0

Author: Castello Gimeno Adrián
Duato Marín José Francisco
Peña Monferrer Antonio José
Quintana Ortí Enrique Salvador
Reaño González Carlos
Schultz Scot
Shainer Gilad
Silla Jiménez Federico
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/09/2014
Field of study

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] A clear trend has emerged involving the acceleration of scientific applications by using GPUs. However, the capabilities of these devices are still generally underutilized. Remote GPU virtualization techniques can help increase GPU utilization rates, while reducing acquisition and maintenance costs. The overhead of using a remote GPU instead of a local one is introduced mainly by the difference in performance between the internode network and the intranode PCIe link. In this paper we show how using the new InfiniBand Connect-IB network adapters (attaining similar throughput to that of the most recently emerged GPUs) boosts the performance of remote GPU virtualization, reducing the overhead to a mere 0.19% in the application tested.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. This material is based upon work supported by the U. S. Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21), under Contract No. DE-AC02-06CH11357. Authors from the Universitat Politècnica de València and Universitat Jaume I are grateful for the generous support provided by Mellanox Technologies.Reaño González, C.; Silla Jiménez, F.; Peña Monferrer, AJ.; Shainer, G.; Schultz, S.; Castelló Gimeno, A.; Quintana Orti, ES.... (2014). Boosting the performance of remote GPU virtualization using InfiniBand Connect-IB and PCIe 3.0. En 2014 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. 266-267. doi:10.1109/CLUSTER.2014.6968737S26626

Crossref

RiuNet

ETP4HPC’s SRA 5 strategic research agenda for High-Performance Computing in Europe 2022: European HPC research priorities 2023-2027

This document feeds research and development priorities devel-oped by the European HPC ecosystem into EuroHPC’s Research and Innovation Advisory Group with an aim to define the HPC Technology research Work Programme and the calls for proposals included in it and to be launched from 2023 to 2026. This SRA also describes the major trends in the deployment of HPC and HPDA methods and systems, driven by economic and societal needs in Europe, taking into account the changes ex-pected in the technologies and architectures of the expanding underlying IT infrastructure. The goal is to draw a complete picture of the state of the art and the challenges for the next three to four years rather than to focus on specific technologies, implementations or solutions.Peer ReviewedArticle signat per 140 autors/autores: Estela Suarez,JSC; Nico Mittenzwey, MEGWARE; Laurent Cargemel, Atos; Alessandro Russo, Leonardo; Andy Forrester, HypeAccelerator Solutions Limited; Carlo Cavazzoni, Leonardo; Craig Prunty, SiPearl; Daniele Cesarini, Cineca; David Tur, Do IT Now; Fabrizio Magugliani, E4 Computer Engineering SpA; Marc Casas, BSC; Rene Oertel, MEGWARE Computer Vertrieb und Service GmbH; Sergio Sánchez, Vicomtech; Thierry Porcher, Do IT Now; Marc Duranton, CEA / HiPEAC; Sakir Sezer, NVIDIA; Craig Prunty, SIPEARL; Alessandro Russo, Leonardo; Dirk Pleiter, KTH; Luigi Capone, Leonardo; Marco Cicala, E4 Computer Engineering SpA; Osman Unsal, BSC; Paolo Amato, Micron; Petar Radojkovic, BSC; Rod Evans, NVIDIA; Thierry Porcher, Do IT Now; Xavier Martorell, BSC; Manolis Marazakis, FORTH; Maria Perez, UPM / BDVA; Pascale Rossé-Laurent, Atos; Alberto Miranda, BSC; Alberto Scionti, LINKS Foundation; Alessandro Russo, Leonardo; David Tur, Do IT Now; Denis Maggi, Vicomtech; Georgios Goumas, ICCS; Ina Schmitz, ParTec; Jordi Guitart, BSC; Julita Corbalan, BSC; Michele Martone, LRZ; Nico Mittenzwey, MEGWARE; Nicolo Magini, Leonardo; Paolo Viviani, LINKS Foundation; Rene Oertel, MEGWARE Computer Vertrieb und Service GmbH; Sebastien Varrette, University of Luxembourg; Thierry Porcher, Do IT Now; Thomas Moschny, Par-Tec, Guy Lonsdale, SCAPOS; Paul Carpenter, BSC; Gabriel Antoniu, INRIA (BDVA); Alexander Costan, INSA Rennes/Inria; Ani Anciaux Sedrakian, IFPEN; Antonio Peña, BSC; Antonio Sciarappa, Leonardo; Christian Perez, INRIA; Francesco Iannone, ENEA; Jose Gracia, HLRS; Leonardo Arturo Bautista Gomez, BSC; Luigi Capone, Leonardo; Miguel Vasquez, BSC; Olivier Marsden, ECMWF; Paolo Viviani, LINKS Foundation; Patrick Carribault, CEA; Sebastien Varrette, University of Luxembourg; Vicenc Beltran, BSC; Xavier Martorell, BSC; Andre Brinkmann, JGU – Mainz; Sai Narasimhamurthy, Seagate; Anna Queralt, BSC; Jean-Thomas Acquaviva, DDN Storage; Jesús Carretero Pérez, UC3M; Nicolo Magini, Leonardo Labs; Paolo Amato, Micron; Philippe Deniel, CEA; Ramon Nou, BSC; Tiago Quintino, ECMWF; Dirk Pleiter, KTH; Utz-Uwe Haus, HPE; Adrian Tate, NAG; Ani Anciaux Sedrakian, IFPEN; Antonio Sciarappa, Leonardo Labs; Chiara Vercellino, LINKS Foundation; Dario Garcia-Gasulla, BSC; Giovanni Samaey, KU Leuven; Luigi Capone, Leonardo; Marcin Chrust, ECMWF; Maximilian Behr, Northern Data AG, Olivier Beaumont, INRIA; Ricard Borrell, BSC; Ulrich Ruede, CERFACS/FAU; Ward Melis, KU Leuven; Erwin Laure, MPCDF; Andreas Wierse, SICOS; Bruno Raffin, INRIA; Carlo Cavazzoni, Leonardo; Guillaume Houzeaux, BSC; Ioan Hadade, ECMWF; Kim Serradell Maronda, BSC; Luigi Capone, Leonardo; Miguel Vasquez, BSC; Ricard Borrell, BSC; Sabri Pllana; Sinead Ryan, Trinity College Dublin; Vicence Beltran, BSC; Hans-Christian Hoppe, ParTec; Jens Krueger, FRAUNHOFER; Alberto Scionti, LINKS Foundation; Ander Garcia, Vicomtech; Anna Queralt, BSC; Benjamin Depardon, UCit; Craig Prunty, SIPEARL; Daniela Ghezzim, Leonardo S.p.a.; Daniele Piccarozzi, Arm; David Carrera, BSC; Philippe Bricard, UCit; Thierry Goubier, CEA; Venkatesh Kannan, ICHEC; Valeria Bartsch, FRAUNHOFER; Cyril Allouche, Atos; Kristel Michelsen, JSC; Andrea Scarabosio, Links Foundation; Artur Garcia, BSC; Chayma Bouazza, PASQAL; Daniele Dragoni, Leonardo; Daniele Gregori, E4 Computer Engineering SpA; Daniele Ottaviani, Cineca; David Bowden, Dell Technologies; David Tur, Do IT Now; Dennis Hoppe, High-Performance Computing Center Stuttgart; Fabrizio Magugliani,E4 Computer Engineering SpA; Filippo Palombi, ENEA; Giacomo Vitali, LINKS Foundation; Guillaume Colin de Verdière, CEA; Jean-Philippe Nominé, CEA; Leonardo Arturo Bautista Gomez, BSC; Mikael Johansson, CSC; Olivier Terzo, LINKS Foundation; Osman Unsal, BSC; Vekatesh KannanI, CHEC; Erwin Laure, MPCDF; Andreas Wierse, SICOS; Bruno Raffin, INRIA; Carlo Cavazzoni, Leonardo; Guillaume Houzeaux, BSC; Ioan Hadade, ECMWF; Kim Serradell Maronda, BSC; Luigi Capone, Leonardo; Miguel Vasquez, BSC; Ricard Borrell, BSC; Sabri Pllana Sinead Ryan, Trinity College Dublin; Vicence Beltran, BSC; Hans-Christian Hoppe, ParTec; Jens Krueger, FRAUNHOFER; Alberto Scionti, LINKS Foundation; Ander Garcia, Vicomtech; Anna Queralt, BSC; Benjamin Depardon, UCit; Craig Prunty, SIPEARL; Daniela Ghezzi, Leonardo S.p.a.; Daniele Piccarozzi, Arm; David Carrera, BSC; Philippe Bricard, UCit; Thierry Goubier, CEA; Venkatesh Kannan, ICHEC; Valeria Bartsch, FRAUNHOFER; Cyril Allouche, Atos; Kristel Michelsen, JSC; Andrea Scarabosio, Links Foundation; Artur Garcia, BSC; Chayma Bouazza, PASQAL; Daniele Dragoni, Leonardo Daniele Gregori, E4 Computer Engineering SpA; Daniele Ottaviani, Cineca; David Bowden, Dell Technologies; David Tur, Do IT Now; Dennis Hoppe, High-Performance Computing Center Stuttgart; Fabrizio Magugliani, E4 Computer Engineering SpA; Filippo Palombi, ENEA; Giacomo Vitali, LINKS Foundation; Guillaume Colin de Verdière, CEA; Jean-Philippe Nominé, CEA Leonardo; Arturo Bautista Gomez, BSC, Mikael Johansson, CSC; Olivier Terzo, LINKS Foundation; Osman Unsal, BSC; Vekatesh KannanI; Utz-Uwe Haus (HPE); Sai Narasimhamurthy (Seagate); Maria S. Perez (UPM); Dirk Pleiter (KTH); Andreas Wierse (SICOS); Paul Carpenter (BSC); Utz-Uwe Haus (HPE); Erwin Laure (MPCDF); Sai Narasimhamurthy (Seagate Systems); Estela Suarez (Forschungszentrum Jülich); Manolis Marazakis (FORTH); Marc Duranton (CEA); Dirk Pleiter (KTH); Giuliano Taffoni (INAF); Hans-Christian Hoppe (Scapos AG); Gabriel Antoniu (Inria); Patrick Valduriez (Inria); Hans-Christian Hoppe (SCAPOS); Jens Krüger (Fraunhofer ITWM); Andreas Wierse (Sicos BW); François Bodin (Inria); Sagar Dolas (SURF); Damien Gratadour (Université Paris Diderot); Michael Malms (ETP4HPC); Leonieke Mevus (SURF); Pascale Rossé-Laurent (Atos); Jean-Robert Bacou (Atos); Carlos Puchol (BSC); Kristel Michielsen (JSC); Tiina Leiponen (CSC); Carlo Cavazzoni (Leonardo), Ivan Spisso (Leonardo); Maike Gilliot (ETP4HPC); Hans-Christian Hoppe (Scapos); Guy Lonsdale (ETP4HPC Steering Board / Fraunhofer / Scapos / FocusCoE); Fabrizio Magugliani (ETP4HPC Steering Board / E4 Computer Engineering); Jean-Philippe Nominé (ETP4HPC Steering Board / CEA); Andreas Wierse (SICOS); Pascal Bouvry (University of Luxembourg); Daniela Posch (SICOS); Gilles Civario (Intel); Henri Calandra (TotalEnergies & PRACE IAC); Vincent Galinier (Airbus & PRACE IAC); Xavier Vigouroux (Atos); Alban Rousset (LuxProvide); Michael Schlottke-Lakemper (HLRS); Carolina Berucci (Leonardo), Erwin Laure, Carolina Berucci (Leonardo)Postprint (published version

UPCommons. Portal del coneixement obert de la UPC